Please check out the following link to view HTML version of this page: http://htmlpreview.github.io/?https://github.com/kyloop/sb_capstone1/blob/master/Project%20I%20Milestone%20Report.html
The trend of e-commerce blows sky high in the recent years, a lot of mortar retail cooperation turns themselves into the competition of e-commerce (e.g. Walmart , Target etc). The concept of this project is to utilize the dataset provided by Walmart to forecast the weekly sales in 45 stores based on set of features (e.g. Store Size,Unemployment rate, Customer Price Index (CPI) etc). The first part of data analysis result is able to reflect the past or current corporation financial condition for either single store or particular department. In addition, this experiment also explores the fact that how every feature impact on the business outcome which is one of crucial operations for a coroperation. In the second part of this analysis, predictive time-series model will be implemented to forecast the future sales in weekly basis. This analysis step is basically an indispensible step for a company to have a guide or estimate of the future direction of company's profit or even market stock price which helps the management team to prepare business stragegy for the next round of period
The primary audience for this analysis is Walmart management team / executives. The findings from this research gives some hint for them to design their operation format or business plan. Furthermore, the other group of audience properly are those professional who concerns about the business condition about Walmart. The findings from the research definitly offers them some insight about the current or future sales in the mortar stores. The third group would be the people who wants to find out the sales pattern happening in Walmart store or particular department. The analysis also gives the insight how the seasonal factors changes the sales which is very valuable for professionals or merchant .
- Discover the sales distribution in Store and individual department
- Discover the correlation of between features and Sales.
- Discover the features contributes differently to the predictive models
- Determine the best model(s) for forecasting the weekly sales (continuous values)?
- How’s the models perform differently in various States?
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
display(train.head(3))
Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.
The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
cnt=0
for store in train.Store.unique():
for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
cntLst.append(train[var][train.Store == store].isnull().sum())
if cnt<=5:
print("Store #%d"%(store), end="")
print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
print(cntLst, end=" ")
print("in percent", end=" ")
print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
cnt+=1
elif cnt<=6:
print("...")
cnt+=1
else:
pass
cntLst=[]
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
if cnt<=5:
print("Store %d %s - %s"%(store,min(arr),max(arr)))
cnt+=1
elif cnt<=6:
print("...")
cnt+=1
else:
pass
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
cnt=0
for store in train.Store.unique():
arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
if cnt<=5:
print("Store %d %s - %s"%(store,min(arr),max(arr)))
cnt+=1
elif cnt<=6:
print("...")
cnt+=1
else:
pass
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline
init_notebook_mode(connected=True)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Sales Values by each store (Figure 1)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Weekly Sales (in thousands)",
titlefont=dict(size=18),
zeroline=True, range=[-10,200], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Temperature[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Temperature Values by each store (Figure 2)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Temperature(F)",
titlefont=dict(size=18),
zeroline=True, range=[-5,120], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Fuel_Price[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Fuel Price Values by each store (Figure 3)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,
ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Fuel Price($)",
titlefont=dict(size=18),
zeroline=True,
range=[2,5],
showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.CPI[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of CPI Values by each store (Figure 4)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="CPI",
titlefont=dict(size=18),
zeroline=True, range=[100,250], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Unemployment[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Unemployment Rate by each store (Figure 5)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Unemployment Rate",
titlefont=dict(size=18),
zeroline=True, range=[0,20], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
Let's first take a close look on the trend of weekly sales during this two year period. The sales plot captures the sales trend ranges from Feb-2011 - Oct-2012. As we can see that the sales ranges approximate between 0 to 4 million. The sales trends for each store move flat over time except the two holiday seasons which start approximately from Nov 20 to Dec 31 for both years. As we all know, this period covers two big holidays which are Thanksgiving and Christmas retailers traditionally offers huge discount discount to customers. During these high selling period, the sales hike 1.5 - 2 times of the original sales. So that we ought to pay more attention to the prediction for this holiday period.
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline
init_notebook_mode(connected=True)
#Reload Data
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
# Dropping Features
train.drop(columns=["MarkDown1","MarkDown2","MarkDown3","MarkDown4", "MarkDown5"], inplace = True)
#Gather Weekly Sales for each store
namesLst=list(map(str,train.Store.unique()))
dateLst = train.Date.unique()
aa=[]
salesLst=[]
timeLst=[]
storeSalesDict={}
for name in namesLst:
#print(name)
storeSalesDict[int(name)]=[]
for date in dateLst:
aa=round(train.Weekly_Sales[train.Store == int(name)][train.Date == date].sum()/ 1000000,2)
storeSalesDict[int(name)].append(aa)
traces=[]
for i, sales in storeSalesDict.items():
#print(i, sales)
traces.append(go.Scatter(
x= pd.to_datetime(dateLst),
y=sales,
name="Store#"+str(i)
))
import numpy as np
aa= np.zeros((len(storeSalesDict),len(storeSalesDict)))
aa = [[str(aa[i][j]) for i in range(0,aa.shape[0])] for j in range(0,aa.shape[1])]
aa = [[False for i in range(0,len(aa))] for j in range(0,len(aa))]
for i in range(len(aa)):
aa[i][i] = True
updatemenus = list([
dict(type="buttons",
active=-1,
buttons=list([
dict(label = 'Store 1',
method = 'update',
args = [{'visible': aa[0]},
{'title': 'Store 1 Sales'}])]))])
layout = go.Layout(
title = 'Weekly Sales (Figure 6)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=True,
#width=700,
#height=1000,
yaxis =dict(title = "Weekly Sales (in Millions)",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Time Series (in Months)",
titlefont=dict(size=18),
exponentformat='e',
showexponent='all',
zeroline=True,
showgrid=False,
rangeselector=dict(visible=True,
buttons=list([
dict(count=1,
label="1m",
step="month",
stepmode ="backward"),
dict(count=3,
label="3m",
step="month",
stepmode ="backward"),
dict(count=6,
label="6m",
step="month",
stepmode ="backward"),
dict(count=12,
label="12m",
step="month",
stepmode ="backward"),
dict(step ="all")])),
rangeslider = dict(visible=True)
),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False,
updatemenus=updatemenus
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
The figure 7 below shows the total sales generated by different store type (i.e. A, B, C). Obiviously, the sales in regular working holidays takes the major proportion of the sales in holidays because holidays like Christmas and Thanksgiving only takes small portion throughout the year. However, in figure 8, we could observe that the average weekly sales in holidays is higher than weekly sales in working days for Type A, B stores. Sales are basically NO variant for type C store. To be more precise determine their sales performance between holidays or non-holidays, I would propose to set up hypothesis test to further verify.
trace1 = []
salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
trace1 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA, salesB, salesC],
name = "Working day"
)
salesA = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesB = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesC = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
trace2=[]
trace2 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA, salesB, salesC],
name = "Holiday"
)
data = [trace1, trace2]
layout = go.Layout(
title = "Store Type Sale (%s to %s) - Total (Figure 7) "% (min(train.Date), max(train.Date)),
barmode='group',
showlegend=True,
xaxis = dict(title = "Store Type"),
yaxis = dict(title = " Total Sales")
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
#salesA_len= len(list(train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False]))
salesA_len = len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==False])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==False].sum()
salesA_avg = salesA_tot/ salesA_len
salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==False])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==False].sum()
salesB_avg = salesB_tot/ salesB_len
salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==False])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==False].sum()
salesC_avg = salesC_tot/ salesC_len
trace1 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA_avg, salesB_avg, salesC_avg],
name = "Working day - " + str(salesA_len) + " days"
)
salesA_len= len(list(set(train.Date[train.Store_Type=="A"][train.IsHoliday==True])))
salesA_tot = train.Weekly_Sales[train.Store_Type=="A"][train.IsHoliday==True].sum()
salesA_avg = salesA_tot/ salesA_len
salesB_len= len(list(set(train.Date[train.Store_Type=="B"][train.IsHoliday==True])))
salesB_tot = train.Weekly_Sales[train.Store_Type=="B"][train.IsHoliday==True].sum()
salesB_avg = salesB_tot/ salesB_len
salesC_len= len(list(set(train.Date[train.Store_Type=="C"][train.IsHoliday==True])))
salesC_tot = train.Weekly_Sales[train.Store_Type=="C"][train.IsHoliday==True].sum()
salesC_avg = salesC_tot/ salesC_len
trace2 = go.Bar(
x = sorted(list(set(train.Store_Type))),
y = [salesA_avg, salesB_avg, salesC_avg],
name = "Holiday - "+str(salesA_len) + " days"
)
data = [trace1, trace2]
layout = go.Layout(
title = "Store Type Sale (%s to %s) - Weekly (Figure 8) "% (min(train.Date), max(train.Date)),
barmode='group',
showlegend=True,
xaxis = dict(title = "Store Type"),
yaxis = dict(title = " Weekly Sales ")
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
The figure 9 below shows the ranking of Department in terms of Weekly Sales. The highest weekly sales which is about $75000 is generated by Dept 92; The lowest weekly which is -#7.68 is generated by store 47. This plot could demonstrate the sales deficiency between stores.
sales_dept = []
dept_lst=[]
lst=[]
for dept in list(set(train.Dept)):
lst.append(tuple((dept,round(train.Weekly_Sales[train.Dept == dept].sum()/len(train.Weekly_Sales[train.Dept == dept]),2))))
df = pd.DataFrame(lst).sort_values(by=[1], ascending=True)
dept_lst = list(df.iloc[:,0])
sales_dept = list(df.iloc[:,1])
trace3 = []
trace3 = go.Bar(
x = list(df.iloc[:,1]),
y = list(map(str,list(df.iloc[:,0]))),
orientation="h"
#name = "Working day"
)
data = [trace3]
layout = go.Layout(
#barmode='group',
width=700,
height=1200,
#showlegend=True
title = 'Average Weekly Sales Ranking (Figure 9)',
yaxis =dict(title = "Department Number",
#exponentformat='e',
#showexponent='all',
titlefont=dict(size=18),
tick0=5,
ticks="outside",
dtick=1,
tickwidth=2,
showgrid=False,
type='category'),
xaxis = dict(title="Weekly Sales($)",
titlefont=dict(size=18),
zeroline=True,
#range=[2,5],
showgrid=True)
)
annotations=[]
for i in range(len(dept_lst)):
annotations.append(dict(x=sales_dept[i],
y=dept_lst[i],
text="%0.2f"%(sales_dept[i]),
font=dict(family='Arial',
size=12,
color='red'),
showarrow=True,
align = "center",
ax=40,
ay=0,
arrowhead=0))
layout['annotations'] = annotations
fig = go.Figure(data=data, layout=layout)
iplot(fig, show_link=True)
The following scatter plots rougly shows the correlations between the Predictor variables and Target variable.
Unfortunately, it is difficult to determine the intensity of correlation between those features against Weekly Sales except the Store Size (i.e. Seems the higher the Store Size leads to higher Store Weekly Sales). To further verify, hypotheses tests will be performed afterward.
'''*************************No need to Run*****************'''
wsLst=[]
tempLst=[]
fpLst=[]
cpiLst=[]
sizeLst=[]
storeLst=[]
dateLst=[]
for date in list(set(train.Date)):
print(date)
for store in list(set(train.Store)):
storeLst.append(store)
dateLst.append(date)
wsLst.append((train.Weekly_Sales[train.Store == store][train.Date == date]).sum())
tempLst.append(list(set(train.Temperature[train.Store == store][train.Date == date]))[0])
fpLst.append(list(set(train.Fuel_Price[train.Store == store][train.Date == date]))[0])
cpiLst.append(list(set(train.CPI[train.Store == store][train.Date == date]))[0])
sizeLst.append(list(set(train.Size[train.Store == store][train.Date == date]))[0])
pd.DataFrame(list(zip(storeLst,dateLst,wsLst, tempLst, fpLst, cpiLst, sizeLst)),
columns= ["store","date","Weekly_Sales", "Temperture", "Fuel_Price", "CPI", "Size"]).to_csv("Weekly_Sales_Corr2.csv")
'''****************Read the Weekly Sales Correlated Feature Table************'''
df = pd.read_csv("Weekly_Sales_Corr2.csv", )
ws=list(df.Weekly_Sales)
temp=list(df.Temperture)
fp=list(df.Fuel_Price)
cpi=list(df.CPI)
size=list(df.Size)
#df.head(5,)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from numpy import arange,array,ones
from scipy import stats
import seaborn as sns
plt.figure(figsize=(10,10))
sns.regplot(x="Fuel_Price", y="Weekly_Sales", data=df)
plt.title("Fuel Price vs Weekly_Sales", fontsize=20)
plt.show()
plt.figure(figsize=(10,10))
sns.regplot(x="Temperture", y="Weekly_Sales", data=df)
plt.title("Temperature vs Weekly_Sales", fontsize=20)
plt.show()
plt.figure(figsize=(10,10))
sns.regplot(x="CPI", y="Weekly_Sales", data=df)
plt.title("CPI vs Weekly_Sales", fontsize=20)
plt.show()
plt.figure(figsize=(10,10))
sns.regplot(x="Size", y="Weekly_Sales", data=df)
plt.title("Store Size vs Weekly_Sales", fontsize=20)
plt.show()
With the hint from above correlation plots, the store size (Independent var.) correlates the most with the Store Weekly_Sales (Dependent var). To be able to verfy this matter, we are going to show correlation values in numeric format as the following heatmap. The values reflects the same result that store size is the only feature has a correlation value higher than 0.5. The rest of the features show very least correlation with the dependent variable Weekly_sales. In this case, I would suggest to compare the prediction results from the models which are with or without least correlated features. Hypothesis Tests will be proceeded to verify this results.
For the correlation between independent variables, we could also observe that they don't have high correlation. I would not consider to remove any of the features at this point. Same as above, Hypothesis Tests will be proceeded to verify this results.
import matplotlib.pyplot as plt
import seaborn as sns
df2 = df.copy()
df2.drop(columns=["Unnamed: 0","store","date"], inplace=True)
df2_corr =df2.corr()
sns.heatmap(df2_corr, linewidths=0.5, annot=True)
plt.xticks(rotation=45)
plt.title("Correlation Heatmap", fontsize=15)
plt.show()
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that all of the features, except Fuel_Price, correlates with Store Weekly Sales. The result also aligns the correlation values above heatmap, e.g. Correlation of FP vs Weekly_Sales = 0.0095. As mentioned, I would not remove any features during the initial Machine Learning process even though some of the features have low correlation values against Weekly_Sales(Dependant Variable). In addition, we might want to compare the performance of models which are built with or without those least correlated features.
#Calculate pearson correlation function
def pearson_r(x,y):
corr_mat = np.corrcoef(x,y)
return corr_mat[0,1]
#Calculate P-value function
def p_value(array, value):
return (value - np.mean(array))/np.std(array)
xi = temp
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Weekly Sales is", p_val[0,1])
xi = fp
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel_Price vs Weekly Sales is", p_val[0,1])
xi = cpi
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for CPI vs Weekly Sales is", p_val[0,1])
xi = size
yi = ws
No_bootstrap_trial = len(ws)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Weekly Sales is", p_val[0,1])
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Store Size correlates with Temperature. On the other hand, Store size does not correlate with other TWO features (i.e. CPI , Fuel Price). The result also aligns the correlation values above heatmap above.
xi = size
yi = cpi
No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs CPI is", p_val[0,1])
xi = size
yi = fp
No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Fuel Price is", p_val[0,1])
xi = size
yi = temp
No_bootstrap_trial = len(size)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Store Size vs Temperature is", p_val[0,1])
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.
xi = fp
yi = temp
No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs Fuel Price is", p_val[0,1])
xi = cpi
yi = temp
No_bootstrap_trial = len(temp)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Temperature vs CPI is", p_val[0,1])
Acccording to the following Hypotheses Test for the features Temperature, CPI and Store Size with either (Z = -5) < -3 or (Z = 65) > 3 which means the p-value is more than 3 standard deviation below the mean. In other words, it shows more than 99.7% of chance the correlation of samples are not the same as the population mean. So that we have strong evidence to reject the Null Hypotheses and proves that the Temperature correlates with Fuel Price and CPI values. The result also aligns the correlation values above heatmap above.
xi = cpi
yi = fp
No_bootstrap_trial = len(fp)
coef_lst=[]
for i in range(No_bootstrap_trial):
x = np.random.permutation(xi)
y = np.random.permutation(yi)
coef_lst.append(np.corrcoef(x,y)[0,1])
org_coef = np.corrcoef(xi,yi)
p_val = p_value(coef_lst, org_coef)
print("The Z value for Fuel Price vs CPI is", p_val[0,1])
The above Hypothesis tests are able to give us some hints how the independent features react with each other and how the correlations between dependent and independent features. The most correlated feature against Weekly Sales (i.e. independent variable) is Store Size and the rest of independent features are comparably less correlated. In this part, we definitly give ourselves some insights prior to the machine learning part. Based on this analysis, we could iterate models with more features combinations in order to search out the most efficient and outperformed model.
NOT all the Predictor variables are informative for the analysis due to huge amount of missing values which are dropped out (i.e. markdown1 - markdown5) for the data analysis and further predictive modeling.
Most of the instances are remained even though outliers values because those data might be reflecting seasonal patterns for the Target variable (i.e. Weekly Sales), e.g. Store sales are constantly higher in holiday seasons
Corrlations: Most of features, except Fuel_Price, are correlated with Target variable: Store Weekly Sales.
Corrlations: Predictor variables are mostly correlated with each other except Store Size vs Fuel Price.
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
display(train.head(3))
#Copy original dataset
train2 = train.copy()
train2 = train.drop(["MarkDown1", "MarkDown2","MarkDown3","MarkDown4", "MarkDown4", "MarkDown5"], axis=1)
###Dummy out Categor
train2=pd.concat((train2,pd.get_dummies(train2["IsHoliday"] , prefix="IsHoliday")),axis=1)
train2 = pd.concat((train2,pd.get_dummies(train2["Store_Type"] , prefix="Store_Type")),axis=1)
train2.drop(["IsHoliday","Store_Type"],axis=1, inplace = True)
train2.columns
train2.head(3)
# load additional module
import pickle
storeLst=[]
for store_no in sorted(list(set(train2.Store))):
for date in sorted(list(set(train2.Date))):
item=train2[train2.Store==store_no][train2.Date == date]
storeLst.append(list([store_no,
date,
round(list(set(item.Temperature))[0],2),
round(list(set(item.Fuel_Price))[0],3),
round(list(set(item.CPI))[0],5),
round(list(set(item.Unemployment))[0],5),
list(set(item.Size))[0],
list(set(item.IsHoliday_False))[0],
list(set(item.IsHoliday_True))[0],
list(set(item.Store_Type_A))[0],
list(set(item.Store_Type_B))[0],
list(set(item.Store_Type_C))[0],
round(np.divide(item.Weekly_Sales.sum(), 1000000),5)]))
with open('train2.data', 'wb') as filehandle:
# store the data as binary data stream
pickle.dump(storeLst, filehandle)
import pandas as pd
import numpy as np
import pickle
with open('train2.data', 'rb') as filehandle:
# read the data as binary data stream
storeLst = pickle.load(filehandle)
train3=pd.DataFrame(storeLst, columns=["store_no", "date","temp","fuel_price","cpi", "Unemploy","store_size",
"holiday_f","holiday_t", "type_A", "type_B", "type_C","wkly_sales" ])
train3.head(3)
new=train3.date.str.split("-",n=2,expand=True)
train3["year"] = new[0]
train3["month"] = new[1]
train3["day"] =new[2]
train3.head()
train4=train3[train3.date <= "2011-11-18"]
test4=train3[train3.date > "2011-11-18"]
from sklearn import preprocessing
#data is pandas dataframe
def generate_x_y(data,timestep, normalize=True):
timestep=timestep
min_max_scaler = preprocessing.MinMaxScaler()
ydata=[]
date_lst=[]
store_lst=[]
lst=[]
for store_no in sorted(list(set(data.store_no))):
# print(store_no)
ydata.append(data["wkly_sales"][data.store_no==store_no].iloc[timestep-1:])
date_lst.append(data["date"][data.store_no==store_no].iloc[timestep-1:])
store_lst.append(data["store_no"][data.store_no==store_no].iloc[timestep-1:])
arr =np.array(data[data.store_no==store_no])
reshape_length = timestep * arr.shape[1]
for i in range(arr.shape[0]-timestep+1):
s=arr[i:i+timestep]
lst.append(s.reshape(reshape_length))
df=pd.DataFrame(lst , columns=list(data.columns)*timestep)
xdata=df.drop(["store_no", "date", "wkly_sales"], axis=1)
ydata=np.array(ydata).flatten()
date_lst=np.array(date_lst).flatten()
store_lst=np.array(store_lst).flatten()
# Normalize X data
if normalize == True:
normalized_xdata=min_max_scaler.fit_transform(xdata)
return(normalized_xdata,np.array(ydata), date_lst, store_lst)
else:
return(np.array(xdata),np.array(ydata), date_lst, store_lst)
def scorePlot(window_slide_lst, train_score_lst, test_score_lst, modelName):
df= pd.DataFrame(list(zip(window_slide_lst,
train_score_lst,
test_score_lst
)), columns=["timestep","train_score", "test_score"
])
df=df.set_index("timestep")
df.plot(use_index=True, figsize=(10,5))
plt.title(modelName)
plt.ylabel ("Score")
plt.xlabel("Timestep / Window_Slide")
plt.xticks(window_slide_lst)
plt.show()
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
from sklearn.linear_model import LinearRegression
train_reg_score_Lst=[]
test_reg_score_Lst=[]
test_reg_mse_Lst=[]
window_slide_Lst =np.arange(2,10)
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
reg = LinearRegression().fit(xtrain, ytrain)
a=reg.score(xtrain, ytrain)
b=reg.score(xtest, ytest)
c=mean_squared_error(ytest,reg.predict(xtest) )
train_reg_score_Lst.append(a)
test_reg_score_Lst.append(b)
test_reg_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_reg_score_Lst, test_reg_score_Lst, "Linear Regression")
window_slide_Lst =np.arange(2,10)
from xgboost import XGBRegressor
train_xgb_score_Lst=[]
test_xgb_score_Lst=[]
test_xgb_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
xgb = XGBRegressor()
xgb.fit(xtrain, ytrain)
a=xgb.score(xtrain, ytrain)
b=xgb.score(xtest, ytest)
c=mean_squared_error(ytest,xgb.predict(xtest) )
#print(xgb)
#print("Train Score:",a)
#print("Test Score",b)
#print("mean_squared_error", c)
train_xgb_score_Lst.append(a)
test_xgb_score_Lst.append(b)
test_xgb_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_xgb_score_Lst, test_xgb_score_Lst, "XGB Regressor")
from sklearn.linear_model import LassoCV, RidgeCV
train_ridge_score_Lst=[]
test_ridge_score_Lst=[]
test_ridge_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
ridge = RidgeCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
ridge.fit(xtrain, ytrain)
a=ridge.score(xtrain, ytrain)
b=ridge.score(xtest, ytest)
c=mean_squared_error(ytest,ridge.predict(xtest) )
#print(ridge)
#print("Train Score:",a)
#print("Test Score",b)
#print("mean_squared_error", c)
train_ridge_score_Lst.append(a)
test_ridge_score_Lst.append(b)
test_ridge_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_ridge_score_Lst, test_ridge_score_Lst, "Ridge")
from sklearn.linear_model import LassoCV, RidgeCV
train_lasso_score_Lst=[]
test_lasso_score_Lst=[]
test_lasso_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
lasso = LassoCV(cv=5, alphas=(0.001,0.01,0.1, 1.0, 10.0))
lasso.fit(xtrain, ytrain)
a=lasso.score(xtrain, ytrain)
b=lasso.score(xtest, ytest)
c=mean_squared_error(ytest,lasso.predict(xtest) )
#print(lasso)
#print("Train Score:",a)
#print("Test Score",b)
#print("mean_squared_error", c)
train_lasso_score_Lst.append(a)
test_lasso_score_Lst.append(b)
test_lasso_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_lasso_score_Lst, test_lasso_score_Lst, "Lasso")
from sklearn import linear_model
train_sgd_score_Lst=[]
test_sgd_score_Lst=[]
test_sgd_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
sgd = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.00000001,max_iter=10000)
sgd.fit(xtrain,ytrain)
a=sgd.score(xtrain, ytrain)
b=sgd.score(xtest, ytest)
c=mean_squared_error(ytest,sgd.predict(xtest) )
#print(sgd)
#print("Train Score:",a)
#print("Test Score",b)
#print("mean_squared_error", c)
train_sgd_score_Lst.append(a)
test_sgd_score_Lst.append(b)
test_sgd_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_sgd_score_Lst, test_sgd_score_Lst, "SGD Regressor")
from sklearn import linear_model
train_bay_score_Lst=[]
test_bay_score_Lst=[]
test_bay_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
bay = linear_model.BayesianRidge()
bay.fit(xtrain, ytrain)
a=bay.score(xtrain, ytrain)
b=bay.score(xtest, ytest)
c=mean_squared_error(ytest,bay.predict(xtest) )
#print(bay)
#print("Train Score:",a)
#print("Test Score",b)
#print("mean_squared_error", c)
train_bay_score_Lst.append(a)
test_bay_score_Lst.append(b)
test_bay_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_bay_score_Lst, test_bay_score_Lst, "Bayesian Regressor")
from sklearn.ensemble import RandomForestRegressor
train_rfr_score_Lst=[]
test_rfr_score_Lst=[]
test_rfr_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
rfr= RandomForestRegressor(n_estimators=200)
rfr.fit(xtrain, ytrain)
a=rfr.score(xtrain, ytrain)
b=rfr.score(xtest, ytest)
c=mean_squared_error(ytest,rfr.predict(xtest) )
train_rfr_score_Lst.append(a)
test_rfr_score_Lst.append(b)
test_rfr_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_rfr_score_Lst, test_rfr_score_Lst, "Random Forest Regressor")
from sklearn.ensemble import GradientBoostingRegressor
train_gbr_score_Lst=[]
test_gbr_score_Lst=[]
test_gbr_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
gbr= GradientBoostingRegressor(n_estimators=200)
gbr.fit(xtrain, ytrain)
a=gbr.score(xtrain, ytrain)
b=gbr.score(xtest, ytest)
c=mean_squared_error(ytest,gbr.predict(xtest) )
train_gbr_score_Lst.append(a)
test_gbr_score_Lst.append(b)
test_gbr_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_gbr_score_Lst, test_gbr_score_Lst, "Gradient Boosting Regressor")
from sklearn.ensemble import AdaBoostRegressor
train_abr_score_Lst=[]
test_abr_score_Lst=[]
test_abr_mse_Lst=[]
#Iterate Timestep/ Window_slide
for window_slide in window_slide_Lst :
xtrain,ytrain, train_date_lst, train_store_lst = generate_x_y(train4, timestep=window_slide, normalize=True)
xtest,ytest, test_date_lst, test_store_lst = generate_x_y(test4, timestep=window_slide, normalize=True)
abr=AdaBoostRegressor(n_estimators=400, learning_rate=0.001)
abr.fit(xtrain, ytrain)
a=abr.score(xtrain, ytrain)
b=abr.score(xtest, ytest)
c=mean_squared_error(ytest,abr.predict(xtest) )
train_abr_score_Lst.append(a)
test_abr_score_Lst.append(b)
test_abr_mse_Lst.append(c)
scorePlot(window_slide_Lst, train_abr_score_Lst, test_abr_score_Lst, "AdaBoost Regressor")
df= pd.DataFrame(list(zip(window_slide_Lst,
test_reg_mse_Lst,
test_xgb_mse_Lst,
test_ridge_mse_Lst,
test_lasso_mse_Lst,
test_sgd_mse_Lst,
test_bay_mse_Lst,
test_rfr_mse_Lst,
test_gbr_mse_Lst,
test_abr_mse_Lst)), columns=["timestep",
"LR", "xgb", "LR-Ridge", "LR-Lasso",
"SGD Regressor",
"Bayesian Regressor",
"Random Forest Regressor",
"Gradient Boosting Regressor","ada Boost Regressor"
])
df=df.set_index("timestep")
df.plot(use_index=True, figsize=(10,5))
plt.title("MSE Comparision")
plt.ylabel ("MSE")
plt.xlabel("Timestep")
plt.xticks(np.arange(2,7))
plt.show()